Lessons from FTM: an Experiment in the Design and Implementation of a Low Cost Fault Tolerant System

نویسندگان

  • Gilles Muller
  • Michel Banâtre
  • Mireille Hue
  • Nadine Peyrouze
  • Bruno Rochat
چکیده

This report describes an experiment in the design of a general purpose fault tolerant system, FTM. The main objective of the FTM design was to implement a “low-cost” fault tolerant system that could be used on standard workstations. At the operating system level, our goal was to provide a methodology for the design of modular reliable operating systems, while offering fault tolerance transparency to user applications. In other words, porting an application to FTM had only to require compiling the source code without having to modify it. These objectives were achieved using the Mach micro-kernel and a modular set of reliable servers which implement application checkpoints and provide continuous system functions despite machine crashes. At the architectural level, our approach relies on a high performance stable storage implementation, called Stable Transactional Memory (STM), which can be implemented either by hardware or software. We first motivate our design choices, then we detail the FTM implementation at both architectural and operating system level. We comment on the reasons for the evolution of our stable memory technology from hardware to software. Finally, we present a performance evaluation of the FTM prototype. We conclude with lessons learned and give some assessments. Key-words: Fault Tolerance, Blocking Consistent Checkpointing, Stable Memory, Modular Operating System, Micro-kernel. Leçons du projet FTM : une expérimentation dans la conception d’un système tolérant les fautes de faible coût Résumé : Ce document présente une expérimentation dans la conception d’un système tolérant les fautes à vocation générale, le FTM. Notre motivation principale était la conception d’un système de faible coût pouvant être utilisé sur des stations de travail standard. En ce qui concerne le système d’exploitation, notre objectif était de développer une méthodologie de conception de systèmes d’exploitation fiables offrant la transparence de la tolérance aux fautes aux applications utilisateurs. Autrement dit, le portage d’une application sur FTM ne doit nécessiter que la compilation du logiciel source sans avoir à modifier ce dernier. Nos objectifs ont été atteints en utilisant le micro-noyau Mach et un ensemble modulaire de serveurs fiables qui implément les points de reprises des applications et offrent un service système continu, malgré la défaillance d’une machine. Au niveau de l’architecture, notre approche a reposé sur la conception d’une mémoire stable rapide pouvant être mise en œuvre soit par matériel, soit par logiciel. Nous décrivons tout d’abord nos choix de conception, puis nous présentons la mise en œuvre du FTM en ce qui concerne l’architecture et le système d’exploitation. En particulier, nous décrivons l’évolution de la technologie mémoire stable depuis sa mise en œuvre par matériel jusqu’à son implémentation par logiciel. Enfin, nous présentons une évaluation des performances du prototype qui a été réalisé au cours de cette étude. Nous concluons en tirant les leçons de ce projet. Mots-clé : tolérance aux fautes, points de reprise cohérents, mémoire stable, système d’exploitation modulaire, micro-noyau. Leçons du projet FTM : une expérimentation dans la conception d’un système tolérant les fautes de faible coût

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Lessons from FTM: An Experiment in Design and Implementation of a Low-Cost Fault-Tolerant System

This report describes an experiment in the design of a general purpose fault tolerant system, FTM. The main objective of the FTM design was to implement a “low-cost” fault tolerant system that could be used on standard workstations. At the operating system level, our goal was to provide a methodology for the design of modular reliable operating systems, while offering fault tolerance transparen...

متن کامل

CAFT: Cost-aware and Fault-tolerant routing algorithm in 2D mesh Network-on-Chip

By increasing, the complexity of chips and the need to integrating more components into a chip has made network –on- chip known as an important infrastructure for network communications on the system, and is a good alternative to traditional ways and using the bus. By increasing the density of chips, the possibility of failure in the chip network increases and providing correction and fault tol...

متن کامل

Fault-tolerant adder design in quantum-dot cellular automata

Quantum-dot cellular automata (QCA) are an emerging technology and a possible alternative for faster speed, smaller size, and low power consumption than semiconductor transistor based technologies. Previously, adder designs based on conventional designs were examined for implementation with QCA technology. This paper utilizes the QCA characteristics to design a fault-tolerant adder that is more...

متن کامل

Fault-tolerant adder design in quantum-dot cellular automata

Quantum-dot cellular automata (QCA) are an emerging technology and a possible alternative for faster speed, smaller size, and low power consumption than semiconductor transistor based technologies. Previously, adder designs based on conventional designs were examined for implementation with QCA technology. This paper utilizes the QCA characteristics to design a fault-tolerant adder that is more...

متن کامل

Novel Defect Terminolgy Beside Evaluation And Design Fault Tolerant Logic Gates In Quantum-Dot Cellular Automata

Quantum dot Cellular Automata (QCA) is one of the important nano-level technologies for implementation of both combinational and sequential systems. QCA have the potential to achieve low power dissipation and operate high speed at THZ frequencies. However large probability of occurrence fabrication defects in QCA, is a fundamental challenge to use this emerging technology. Because of these vari...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1995